Endpoint and timeout fixes for sharded-CI flakes by rockbmb · Pull Request #621 · open-web3-stack/polkadot-ecosystem-tests

rockbmb · 2026-05-19T22:16:36Z

A batch of CI-reliability fixes surfaced by sharded test runs (see polkadot-fellows/runtimes#1180, which consumes PET via runtimes-master).

Endpoint changes

Drop wss://collectives.api.onfinality.io/public-ws from collectivesPolkadot. The public-tier endpoint returns -32029: Too Many Requests under sustained load.
Replace dead wss://us.bifrost-rpc.liebi.com/ws (the only configured endpoint for bifrostKusama) with hk. plus the no-region Liebi default.
Refresh KNOWN_GOOD_BLOCK_NUMBERS_*.env. The previous bump shipped a stale Bifrost Kusama fallback because the dead us. endpoint blocked yarn update-known-good.

Timeouts

defineChain.ts raises the per-chain timeout to 90s. SetupOption.timeout in chopsticks-utils only governs the test-side WsProvider; chopsticks' upstream WsProvider has a separate rpc-timeout that has no path from SetupOption. This PR bundles a .yarn/patches patch that adds an rpcTimeout field to SetupOption and forwards it as rpc-timeout, mirroring AcalaNetwork/chopsticks#1034. The patch can be dropped once a chopsticks release includes that change.

Exclusions

bifrostKusama.* and karura.bifrostKusama.xcm.test.ts: every public Bifrost Kusama RPC either rejects connections or prunes state at the pinned block. Excluded until a workable endpoint set exists.
acala.*.test.ts: Subway hardcodes its per-upstream request_timeout to 30s and doesn't expose it in ClientConfig, so heavy Acala storage queries force Subway to cycle through the 3 Liebi endpoints without serving a response. AcalaNetwork/subway#203 adds the missing field and is merged but pending a fresh tag with a working release artifact; the exclusion can be reverted once that lands.

The public-tier endpoint is rate-limited (RPC error -32029, "Please apply an OnFinality API key") under the sustained load produced by sharded CI runs, observed in polkadot-fellows/runtimes#1180.

github-actions · 2026-05-19T22:17:38Z

No issues found.

Acala XCM tests (acala.astar, acala.bifrostPolkadot, etc.) hit the 60s timeout on every Acala endpoint Subway cycles through, on the same shard that surfaced the OnFinality rate-limit failure in polkadot-fellows/runtimes#1180. The Acala public RPC pool is slow enough under load that the heavy XCM-Transact storage queries don't return in 60s on any individual endpoint, so Subway burns the full timeout per upstream before rotating, never gets a response, and the test fails. 90s gives those queries enough headroom while still capping a genuinely stuck call. Block numbers are bumped at the same time to keep state lookups close to chain head.

`wss://us.bifrost-rpc.liebi.com/ws` was the only endpoint for bifrostKusama and is currently network-dead (handshake timeout, probed live). Test runs stall on bifrostKusama because Subway has no fallback to cycle to. The two replacements (`hk.` and the no-region default) both respond and serve state at current tip; the no-region host is Liebi's DNS-load-balanced entrypoint and adds geographic redundancy in case `hk.` ever goes the way of `us.`.

`ba34d62` shipped `BIFROSTKUSAMA_BLOCK_NUMBER=13903082` as a stale script fallback because the only configured Bifrost Kusama endpoint was unreachable at the time, and no public RPC retained that block's state. The previous commit fixes the endpoint; this re-runs `yarn update-known-good` against the live endpoint to record a block number that is actually servable, and refreshes every other chain's block in the same pass.

`defineChain.ts` already set `timeout: 90_000` in the per-chain chopsticks config, but `SetupOption.timeout` only controls the test-side WsProvider that talks to the in-process chopsticks server; it leaves chopsticks' own upstream WsProvider on its 60s default, which is what produces the `No response received from RPC endpoint in 60s` errors seen on Acala in the previous CI run on this branch. Bundles a yarn patch that adds an `rpcTimeout` field to `SetupOption` and forwards it as `rpc-timeout` in the chopsticks config (mirroring AcalaNetwork/chopsticks#1034), and sets it to 90s in `defineChain.ts`. The patch can be dropped once a chopsticks release includes #1034.

`wss://us.bifrost-rpc.liebi.com/ws` (only configured endpoint until the previous commit) is network-dead, and the alternative Liebi hosts (`hk.`, no-region) only serve current-tip state; they don't retain the historical state at the block PET pins to, so chopsticks setup fails with `UnknownBlock: State already discarded` on every fresh shard. Until a public Bifrost Kusama endpoint retains state at our pinned block (or PET runs against an archive-quality endpoint operator specifically), the four `bifrostKusama.*` E2E suites and the cross- chain `karura.bifrostKusama.xcm` suite are excluded from collection. The other Kusama suites are unaffected.

The chopsticks-side patch in this PR raised `rpcTimeout` to 90s, but Subway hardcodes its own per-upstream `request_timeout` to 30s (with no field exposed in `ClientConfig` to override it). Heavy Acala storage queries take longer than 30s, so Subway cycles through the 3 Liebi endpoints (~30s each) without serving a response, and chopsticks times out before the cycle completes. Excluding Acala suites until Subway exposes `request_timeout` as a config field.

* Expose per-upstream client timeouts and retries in `ClientConfig` `Client::new` already accepts `request_timeout`, `connection_timeout`, and `retries` arguments, but `from_config` hardcodes all three to `None` because `ClientConfig` only exposes `endpoints` and `shuffle_endpoints`. As a result the only way to override the 30s per-upstream request timeout (and the 30s connection timeout, and the default retry count) is to construct `Client` directly in Rust, which isn't reachable from the YAML-driven config. Adds three optional fields to `ClientConfig`: - `request_timeout_seconds` - `connection_timeout_seconds` - `retries` `from_config` plumbs them into `Client::new`. None of the existing defaults change when the fields are omitted. The motivating case is heavy storage queries against slow public RPCs (Acala under load is the case that surfaced this in `polkadot-fellows/runtimes#1180` / `open-web3-stack/polkadot-ecosystem-tests#621`) where 30s per upstream is not enough and Subway exhausts its endpoint cycle without serving a response. * cargo fmt * feat(bench): Add client config options for connection timeout, request timeout, and retries --------- Co-authored-by: Bryan Chen <xlchen1291@gmail.com>

…lectives

`request_timeout_seconds: 90` on Subway's upstream client (added to `subway-template.yml` in the previous commit) gives Subway enough time per upstream attempt for Acala storage queries to land before the 30s default forced it to cycle endpoints. The exclusion added in PR #621 is no longer needed and is removed; the exclusion comment is narrowed to bifrostKusama, which still lacks a workable endpoint set.

…pstream timeout (#622) * Install Subway from upstream `v0.1.0` musl release in `ci.yml` Switches `cargo install --git` to a `curl | tar -xz` of the released static binary (https://github.com/AcalaNetwork/subway/releases/tag/v0.1.0, published by AcalaNetwork/subway#202). Removes the Rust toolchain install, Subway-HEAD commit-hash lookup, and Swatinem cache layer that existed only to amortise the `cargo install` cost — none of them have any other consumer in this workflow. * Install Subway from upstream `v0.1.0` musl release in `update-known-good.yml` Same swap as the previous commit, applied to the periodic block-number update workflow. * Install Subway from upstream `v0.1.0` musl release in `update-snapshot.yml` Same swap as the previous two commits, applied to the snapshot-update workflow. * Fail Subway download fast on HTTP errors (`curl -f`) Without `-f`, an HTTP 4xx/5xx response (e.g. release deleted, GitHub degraded) leaves `curl` exiting zero with the error body on stdout, and the downstream `tar -xz` fails with a confusing "not in gzip format" message instead. Per review on PR #622. * Install Subway by extracting binary from `acala/subway:v0.1.1` Docker image The `v0.1.1` GitHub Release at AcalaNetwork/subway is missing its `x86_64-unknown-linux-musl.tar.gz` asset; the release workflow's `Build release binary` step failed (`cargo build --locked` mismatched the bumped `Cargo.toml` version), so the upload was skipped. The upstream tag still produces a working Docker image because `docker.yml` doesn't use `--locked`, so `acala/subway:v0.1.1` is the only working consumption path for v0.1.1. The image's binary lives at `/usr/local/bin/subway` (per Subway's Dockerfile); copying it out with `docker create` + `docker cp` lands in roughly the same wall time as the curl-and-untar path and unblocks consumption of PR #203's `request_timeout_seconds` config field. * Set Subway per-upstream `request_timeout_seconds` to 90s Subway's default per-upstream request timeout is 30s. With three Acala public RPC endpoints, heavy storage queries that take longer than 30s cause Subway to cycle through all three endpoints (~90s) before any single upstream has a chance to respond, and the test-side waiting client times out. `request_timeout_seconds` was added to `ClientConfig` in AcalaNetwork/subway#203 (Subway v0.1.1+). Setting it to 90 lets a single upstream attempt run long enough to complete those queries instead of being preempted by Subway's own per-endpoint clock. The companion exclusion of Acala tests in `vitest.config.mts` is intentionally left in place; this commit only restores Subway's ability to wait long enough. Lifting the exclusion is a separate verification step. * Re-enable Acala test suites `request_timeout_seconds: 90` on Subway's upstream client (added to `subway-template.yml` in the previous commit) gives Subway enough time per upstream attempt for Acala storage queries to land before the 30s default forced it to cycle endpoints. The exclusion added in PR #621 is no longer needed and is removed; the exclusion comment is narrowed to bifrostKusama, which still lacks a workable endpoint set.

Drop public OnFinality endpoint for collectivesPolkadot

de77939

The public-tier endpoint is rate-limited (RPC error -32029, "Please apply an OnFinality API key") under the sustained load produced by sharded CI runs, observed in polkadot-fellows/runtimes#1180.

rockbmb self-assigned this May 19, 2026

rockbmb added the ci label May 19, 2026

rockbmb requested a review from xlc May 19, 2026 23:35

rockbmb added 2 commits May 20, 2026 00:07

rockbmb changed the title ~~Drop public OnFinality endpoint for collectivesPolkadot~~ Endpoint and timeout fixes for sharded-CI flakes May 20, 2026

rockbmb mentioned this pull request May 20, 2026

Forward rpcTimeout from SetupOption to chopsticks config AcalaNetwork/chopsticks#1034

Merged

rockbmb added 3 commits May 20, 2026 01:08

rockbmb mentioned this pull request May 20, 2026

Expose per-upstream client timeouts and retries in ClientConfig AcalaNetwork/subway#203

Merged

xlc approved these changes May 20, 2026

View reviewed changes

rockbmb force-pushed the drop-onfinality-collectives branch from e99a143 to dbb5d6d Compare May 20, 2026 12:30

Merge remote-tracking branch 'origin/master' into drop-onfinality-col…

f6c715a

…lectives

rockbmb force-pushed the drop-onfinality-collectives branch from dbb5d6d to f6c715a Compare May 20, 2026 12:37

rockbmb merged commit 63a09c8 into master May 20, 2026
13 checks passed

rockbmb deleted the drop-onfinality-collectives branch May 20, 2026 13:08

rockbmb mentioned this pull request May 20, 2026

Install Subway from acala/subway:v0.1.1 Docker image; set 90s per-upstream timeout #622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endpoint and timeout fixes for sharded-CI flakes#621

Endpoint and timeout fixes for sharded-CI flakes#621
rockbmb merged 8 commits into
masterfrom
drop-onfinality-collectives

rockbmb commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rockbmb commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Endpoint changes

Timeouts

Exclusions

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rockbmb commented May 19, 2026 •

edited

Loading